Captions provided by @chaselfrazier and @whitecoatcapxg. "OpenStreetMap US by the Numbers, for the Community." By: Mikel Maron. >> Everybody had their coffee? We're going to go through a lot of material. Very fast. Pay attention. My name is Mikel Maron, I'm on the data team at Mapbox, sharing this presentation with Ph.D. student Jenning at CU Boulder had the pleasure spending the last couple of months been a research fellow with us at Mapbox digging into the OpenStreetMap. And I was really thrilled with Katherine's talk this morning with the vision and the kinds of things we need to be thinking about as a community going forward. And in order to where we're going to go because it was kind of a cliché, we need to know where we are. And a lot of times I think the stories that the talk tell about our community and the way that we view our community is not -- I'm not sure we exactly understand what's happening with OpenStreetMap, who we are, and what happens every day. The OpenStreetMap database is huge, and it's a huge amount of data, and we've barrel scratched the surface of analyzing and interpreting it to pull out interesting insights to help us tell stories, which accurately represent who we are and where we want to go. This is the kind of stat which we're very familiar with from the wiki page that shows rejections nearing 3 million, and it does not really well respect OpenStreetMap at all. It's very impressive. Look how much we're growing. But we're actually more around 15,000 users editing every week. And also impressive. We have 3.5, 3.5 billion nodes in the OpenStreetMap database. What does that mean? Why is that -- those could be -- are those good nodes? Are they bad nodes? Where are they? And, yeah, there are definitely a lot of great tools for analysis within the community and a lot of great work. This is one of the applications for looking at this kind of statistics in more detail and things like overview of other users around you. OpenStreetMap contributors really helpful to get a picture of your community. But I feel like there's more to do to make these tools of analysis available to everybody. OpenStreetMap is a global project, but what's really important is very, very global. And in order to tell stories that are very, very local about very local issues in local data, everybody needs to have access to analysis tools in order to derive those insights. Tools need to be easy to access. And we want -- why do we want these numbers and stories? Well, we want to know where we can improve the map more. Where the community's doing well, where it can use more help? And as we go out and on improve the map and improve the community, measure ourselves. It's very -- like a mapping project, it's very hard -- it takes a lot of work to analyze, well, how much impact? How much data was created? What was the quality? What was the following impact on the community? And you want these kinds of numbers to do things like when do mappers come onboard and get very, very excited and very, very hooked into OpenStreetMap, say, here in Seattle, the Seattle community would like to know. Who are these folks? And how can I reach them? Can you alert me immediately when someone gets hooked so that I can reach out and bring them, tell them about State of the Map U.S. and all the events that might be happening. Or someone who's been inactive and has drawn off of OpenStreetMap, how do you inspire them again to get involved. So numbers. We have a lot of numbers. So we focus just on the U.S. in this analysis. And here's a similar graph to what we had before. We have going on 300 million edits in OpenStreetMap in the U.S. Okay. That sounds like a lot. [Applause] Yeah. We have 300 million edits in OpenStreetMap. Yes. But where are they? What are they? What does this actually tell us about the quality of the map and about where we need to do more work? This is the number of active users per week. And I was actually pretty pleased and surprised about this. Nearly 1,000 people every week edit OpenStreetMap just in the U.S. I think that's incredible. Over all of time since the beginning of the project, nearly 70,000 people have edit something in OpenStreetMap. So that's something -- so 1,000 are active every week, what about the other 69,000 editors? What happened to them? How can we reach them? Breaking it down by city, it starts to get interesting. Hopefully this starts to, like, trigger some interim metro competition. You can see San Francisco, New York, Los Angeles, Washington, and Seattle respectively at number five in terms of number of active users. And then we started to look a little bit at some of the features in particular. This is a graph per day -- per week, kilometers of roads edit started in 2006, and you can see the evidence of the other spikes are subsequent like the street, bots and stuff like that. But then there's the lower line of more humid editing. And one thing that has been interesting I think built a little bit off of what Alan McConchie's talk earlier was examining, the blue line is the kilometers of roads that have been edit in OpenStreetMap have just been created in OpenStreetMap. The green line is the number of rose which have subsequently been edit by someone else. So they represent another edit after the initial creation in OpenStreetMap. So you see in 2008 that -- there's just a steady climb of new roads. But not so much since then. But some time in 2013 we actually had more edits on roads than original creation of roads. And I think this says something about, like, the health and the quality of the community when some things have been edited again and again and again. That means you can have -- you can trust more quality when more people have looked at something and chosen to correct it. Now, this breaks it down by metro, and I think surprisingly correlates very well size and the sprawl of the city. So Los Angeles is great, they have a lot of roads, they have a lot of sprawl. Yeah. A lot of roads to map in Los Angeles. Seattle much more compact and dense, so you see overall there's less roads mapped in Seattle. Now, buildings are interesting because they have this echo effect from the initial structure of the road network. Once this road network is in place, and we have reasonable confidence in it, later on that's when we start to see buildings, and it's much more spikey because there's been a number of building imports over -- you know, over the past -- since 2009. And it will be interesting to investigate further what do these spikes represent? What are the activities? What has actually happened in the community that has caused this kind of spikey growth? And it becomes very clear when you break it down by metro region. And Los Angeles just in 2016 has surged to the most densely mapped in terms of buildings with the L.A. building. But you can see buildings, buildings have been very much a function of imports, and you can see when various communities have taken on building imports. I live in Washington D.C. Interesting. Well, we're not even in the top here. But some of the -- like Boston several stages of imports were happening. And unlike roads, there are ten times as many new buildings being added than have been edited. So if you go back to roads, we're nearly ten -- yeah. I think this shows it. This is a ratio of the number of annual created kilometers of road versus likewise number of buildings edits. And you can see the ratio is much higher for annual created roads, that's the green line, and then quickly comes down to the point where we have at this point ten times as many edits to existing roads as creation of new roads. Buildings, there's ten times as many new buildings added to OpenStreetMap than are edit. So what's the feature for buildings? Do we expect this just to be the pattern over time? Is there something about that the stick to the logical structure of roads which means they're likely to be more edit than buildings will be? Or will we see eventually, like, a downward trend on buildings as the map matures, and we add more and more detail to buildings? Our approach, tiles are the unit of analysis. So tools understand tiles for map display. We use tiles for analysis. We want to take an approach, which uses open source tools, which uses tools that are easily available, which don't require building big databases with OpenStreetMap data. So OSM tiles are vector tiles containing all of OpenStreetMap, preprocessed every day, you can download the whole -- and access the whole world. You can get country extracts, historic extracts, and we use this at Mapbox in November we did a road coverage comparison globally comparing miles of road against the Seattle map book. They tried to get a sense of how high quality and accurate OpenStreetMap. So according to that, we are well beyond what the CIA book says we have. Well over 100% of the roads are covered. But that's not true. There's probably more roads that we have to edit. So how can we discover intrinsic properties of the data which reveal quality? When we don't have any external -- we're going to be the biggest database of open space data and most current and have nothing to compare to. How do we evaluate when we have good quality data? We have to look at the structure of the data itself. OSM open source tools of tiles and reduce to -- this is just a little bit of code not to go into but just to show, it's small. So to write these kinds of analysis scripts, it's very simple. A few lines of coding, it runs insanely, insanely fast, and you can do all the analysis generated incredibly speedy. This approach has been incorporated in OS analytics, a dashboard view on this kind of analysis. Definitely if you haven't seen it, look at it, and we'll have to walk you through it because it's not entirely obvious. But you select an area, and you can see over time these kinds of graphs of well-defined area. So this is Seattle on the bottom it shows very particularly when edits happened. And it does things like this. Let's swipe and see how many were there before and after. This was some place in Ecuador showing the number of buildings that were added after the recent disasters there. Okay. So looking at how we can determine things like quality, coverage, and community within the data. So we have a couple different kind of metrics that you can think about with looking at quality. You can look at the data approach like OSM which is going to use the data known issues that you can compare to known sources like book and see the kinds of coverage we need. You can do full qualitative analysis and sit down and look at every single tile and say that's good. That's interesting. And that might take a while. But what none of those kind of metrics take into account is the user. We have an extra piece of this spatial data OSM is that it's contributed by people. There's a user account, there's a time, there's this kind of extra component of the data that makes it distinct from other geo spatial data. So we want to incorporate this aspect, the user aspect into the data to learn about the process of creation and what's the result that that has on quality in the data that we see? So one approach that we've also been taking, we've been doing a lot with tile framework and then a little bit with Postgres and Quickbook and see what the features are. So this is one of the first virtualizations that we put together. So as you can see here there's this heat map of the U.S. through the last 12 years, and this is looking at the density of editors. So this is a simple heat map going to white to yellow to red based on the number of active on a given tile. And you can see all of these tiles have at least three editors active on them. So you see the hot spots in the denser areas. Somewhere in there you see these road networks where people are editing roads and highways. So just this high level overview you get this pulse of what the contribution activity looks like. And where users are focusing their time. So that's the where. And then this is the when. So this is showing when the edits are happening on the map within the given year. So it's going from light blue to red depending on the month of the year. So you can see in 2007 you have the TIGER import there, and you notice it mixes Texas and then Texas comes in 2008. And then these areas that are kind of maintaining in blue have not been edited later in the year. They were touched in the beginning of the year and then not. Edited later. So this gives you the idea of the scaleness if you will of data through any given year. And this gives us another kind of view to investigate. What does that mean? We don't want to say that something hasn't been touched. It's bad quality per say. But it's interesting that it shows us that this is what the contribution activity looks like. So now we want to dive into that a little further. So we can use that to launch us off into some stories and looking at how the editing is taking place. So first let's take a look at the road story in the U.S. And to do that, we've gone through and extracted, edit histories for every OSM contributor through all of time. And then we've built some networks based on where users are editing on the same tile. So here's the 2007 network for the roads in the U.S. And so what we have here are the date, the Hansen TIGER account, the network and then all the other node the blue around the editors that have edit on the same tile on the Dave Hansen TIGER account. So most of the road network is going to have the road structure starting with this account and then we can start to step that up. And now we go to 2008, and we see that that network is branching out. So we see that people are editing these roads and building on of their next edits. So we see this growing structure there. And that's at 25 kilometers of roads. And if we bump it up to a minimum of 100 kilometers of roads per Z12-level tile, we see that we still have the network there in the bottom left. And notice these other kind of network emerge and that's the San Francisco network in 2008. So these are users that are then editing in the same tiles. So we get this night coediting network that comes out that we want to explore more. So that's looking at roads. Let's look at buildings. So here's 2009. Starting in the Bay Area again. These are edits of more than 40 buildings on a given tile, both users have edit at least 40 buildings on the given tile, and you see a different structure emerging. You see the smaller dual networks popping up, smaller diodes. And then let's step into 2010. Always my favorite, the OSM data, 2010 the building network in North America and Central America, we should be not surprised here to see this giant clump, this big cluster here in Haiti of course. And this is the response to the 2010 Haiti earthquake where all of these people are headed in together in that region. So that's looking at 40 buildings. So let's go ahead and step in to example of trying to explore what this might look like, more of a local level. So 2012 let's look at one of these smaller, a clique here of four users active in the Mississippi area. So let's explore some of their individual edits. Exploring these users, we found this first user has 1.1 -- or 1,000 edits in this -- on the map, and you see that 59% of their edits are happening on that single tile. The rest are in the surrounding tiles. So we have this user that's making over 1,000 edits to the map at one point, 1,800 total. But they're mostly focused on this area, which is very interesting. What does that tell us about that user in the area? What about this user? 700 edits total and 70% of them are happening on that one tile. So you have more of this central focused area. And then we have this user who is also active here with 320 edits there on that tile. But it's only 4% of their edits. This user has 7,000 edits total. 7.7. So if we look at the majority of their edits, it's actually in Germany. One tile with 21% of their edits, which I think is really interesting. They've done some editing in the U.S., and we go and look at the account, and we find out they're very active in the open railways mapping project, so they also happen to be over manies. So who are the editors? That's interesting we have this German editor mapping manies. Looking at the surrounding tiles back manies now, we can find top contributors. So here's this one user that has three edits on this tile and always fun to find few edits on a given tile because then we look at all their edits and you see wow this user is very active in the south and a lot of automated edits here throughout the deep south here. Even though this user has 70,000 edits in this whole area, we then look at their focus area, we find that they are focusing -- these are just the tiles of all the tiles of editing where they have more than 5% of their activity. Even though they're spread out, they have this central focus, which I think is interesting you find this across all users or most users. Every user we've looked at. So in Seattle let's look at the Seattle story. So here's the Seattle building network. And so this is all the contributors editing over 100 buildings. And we shouldn't be too surprised that it's going to the -- it's going to be the Seattle import at the beginning. At the center, and it's going to spread out. And other two clusters are actually -- I want to say Chicago and New York, which is very interesting that those happen to be connected. So these are users that are involved in the core processes and going around making edits on all of these tiles. So this kind of opens up a lot of potential how we can start to explore this data and take a look at who the contributors are and what's the result of looking at the kind of mapping practices and what can we learn of how these practices are taking place and what the result of map is looking like? So measure more in depth, we want to look at -- we want to keep building on this idea of map gardening and looking at, you know, new versus edited, et cetera. And ratios of density of number of buildings and number of features looking at -- compared to users. So, again, all of these new kind of metrics we're looking at quality and coverage by taking into account not just the features themselves but the metadata of the user. >> So I know that was a lot. We have one thing for you to do yourselves. We want to connect more demographic information about our community with the kind of edits we're making. So this is a small survey, the OpenStreetMap U.S. census, we're not releasing any identifiable information. But in aggregate looking at information in our community and how that correlates with editing behavior is the goal. And we want to work together with everyone else who's interested in these kinds of questions whether you want to help develop or whether you just want to ask really interesting questions. There's lots of talks happening about analysis. Sterling is up in six minutes and tomorrow afternoon at 4:00 p.m. there's another series of analysis talks, which are going to be great. Today at 4:45 in 107, we're going to have birds of a feather to talk in a little bit more slow paced the analysis work that we've been doing, and we want to hear what everyone else has been working on and what we can do together. And on Monday, we want to have a code sprint during the hands on day focused on analysis tools. That link will go to all of the interactive maps that Jenning has showed, so you can step through all the way to 2006 and see what you were editing back then and who else you were editing with. That's a lot of fun. That is all. Thank you very much. [Applause] Maybe have time for one question. >> Have you tried normalizing any of these figures with population or any other denominator? >> That's our next step. >> Yeah. We haven't looked at, like, any other but I think it would be one thing is world pop and getting population estimates and getting that down to a tile level and then using that how does that correspond to a baseline of the amount of edits that you want to see in OpenStreetMap quality? And that varied quite a bit of course depending on country in urban versus rural. But, yeah, I think that should be interesting. One more. >> Could you put that slide up for the survey? >> Sure. >> My fingers don't type that fast. >> We'll tweet it out too. >> Okay. >> Thanks very much. [Applause]